A novel speech emotion recognition method based on feature construction and ensemble learning

In the field of Human-Computer Interaction (HCI), speech emotion recognition technology plays an important role. Facing a small number of speech emotion data, a novel speech emotion recognition method based on feature construction and ensemble learning is proposed in this paper. Firstly, the acoustic features are extracted from the speech signal and combined to form different original feature sets. Secondly, based on Light Gradient Boosting Machine (LightGBM) and Sequential Forward Selection (SFS) method, a novel feature selection method named L-SFS is proposed. And then, the softmax regression model is used to learn automatically the weights of the four single weak learners including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Extreme Gradient Boosting (XGBoost) and LightGBM. Lastly, based on the learned automatically weights and the weighted average probability voting strategy, an ensemble classification model named Sklex is constructed, which integrates the above four single weak learners. In conclusion, the method reflects the effectiveness of feature construction and the superiority and stability of ensemble learning, and gets good speech emotion recognition accuracy.

Response to the reviewer 1.
1. The literature study of this paper is poor. It is recommended to consider the recent literautre and also provide the justification that how the proposed work is better over the existing ones.
In this paper, we propose a method to improve the level of human-computer interaction. The main purpose of this method is to improve the performance of speech emotion recognition. The references in this paper are not up-to-date, but conform to the research theme of this paper. At the same time, the experimental comparison of this paper is carried out for the data set CASIA (Chinese emotion corpus).
According to the revision point, in CSCD and SCI database we searched the recent research of speech emotion recognition based on CASIA data set. The latest research literature are mostly based on neural network, and the model complexity is high and the performance is not very good. For example, the accuracy of a novel heterogeneous parallel revolution Bi LSTM for speech emotion recognition in CASIA data set is only 79.67%, the accuracy rate of a novel user emotional interaction design model using long and short term memory networks and deep learning on CASIA data set is only 72.5%, and the accuracy rate of speech emotion recognition based on transfer learning from the facenet framework (a) on CASIA data set is only 90%, the accuracy of attention based revolution skip bidirectional long short term memory network for speech emotion recognition in CASIA data set has only reached 72.5%. Therefore, we still insist on the effectiveness of the original references.

The feature section (Section 3) need more detailed discussion.
In the section 3 (feature construction), firstly we introduced how to extract speech features (feature extraction), and then we introduced how to select subset from feature set (feature selection).
In feature extraction subsection, we have referred to many literatures on speech signal processing, after experimental comparison (not appearing in the paper), we select the acoustic features in the paper and calculate the statistical characteristics of these acoustic features. How to calculate these acoustic features and their statistical characteristics are not detail discussed in the paper, because these calculation processes can be found in references.
In feature selection subsection, we mainly introduced the feature selection method of Following the revision point, we add "CASIA Chinese emotional corpus was recorded by the Institute of Automation, Chinese Academy of Sciences. It includes four professional speakers and six kinds of emotions: anger, fear, happiness, neutral, sadness and surprise, a total of 7200 different pronunciation. 300 of the corpus are the same text. That is to say the same text given different emotions to read, these corpus can be used to compare the analysis of different emotional state of the acoustic and rhythmic performance." to the paper.

The time complexity for the proposed algorithms to be estimated and compared using
the existing models.
The main research of this paper is to improve the performance of speech emotion recognition. There is no quantitative analysis of the time complexity of the model, but from the perspective of qualitative analysis, the time complexity of the model in this paper is more complex than a single classifier and simpler than the classifier based on neural network.

5.
There are several machine learning based classification algorithm but the authors studied very few in this paper. Why authors considered only few. Provide the justifications or refer. Machine learning algorithms for wireless senor networks: a survey for comparisons on various classification algorithms.
We know that there are many classification models based on machine learning algorithms , such as linear regression, Bayesian classifier, decision tree, random forest, k-nearest neighbor, support vector machine, etc., but this paper selects k-nearest neighbor, support vector machine, xgboost and lightgbm. The reason is that after reading a lot of references, these four classifiers perform well in speech emotion classification, and the other classifiers do not perform well.

Compare the model using recent existing algorithms.
As the answer to point 1, most of the latest research is based on neural network, and its performance on CASIA data set is not as good as the method proposed in this paper. So, we decided not to add the modification of this part in this paper.

List the limitations on proposed work.
1) The research content proposed in this paper is for CASIA data set, which is for Chinese speech emotion recognition. So, there is a limitation of cross language emotional expression. 2) In this paper, the traditional acoustic feature extraction process reflects the human diligence and wisdom, but there is still not a complete feature set. Besides, the ensemble learning method achieves good results, but it is not clear that increasing (reducing) the number of single weak learner and changing the type of learner will improve the recognition accuracy. So, it remains be further tested.
Response to the reviewer 2.

How does L-SFS help to extract and select features? This is missing.
In the feature selection subsection, we have introduced the steps of L-SFS feature selection method in detail. In addition, in subsection 5.4, we have verified this method in the form of experiments.

The dataset CASIA needs to be explored.
Following the revision point, we add "CASIA Chinese emotional corpus was recorded by the Institute of Automation, Chinese Academy of Sciences. It includes four professional speakers and six kinds of emotions: anger, fear, happiness, neutral, sadness and surprise, a total of 7200 different pronunciation. 300 of the corpus are the same text. That is to say the same text given different emotions to read, these corpus can be used to compare the analysis of different emotional state of the acoustic and rhythmic performance." to the paper.

The rationale for choosing SVM, KNN, LightGBM and XGBoost for constructing
proposed Sklex model.
We know that there are many classification models based on machine learning algorithms , such as linear regression, Bayesian classifier, decision tree, random forest, k-nearest neighbor, support vector machine, etc., but this paper selects k-nearest neighbor, support vector machine, xgboost and lightgbm. The reason is that after reading a lot of references, these four classifiers perform well in speech emotion classification, and the other classifiers do not perform well.

The research questions to be addressed by the proposed work and research motivation
need to be strengthed.

Why ensembling of four models give best results?
Actually, we are not sure that selecting four classifiers will achieve the best performance. At present, the classification effect is ideal. If more classifiers are added, it may cause model flooding and increase the time complexity of the model; At the same time, reducing the classifier will reduce the performance of the model.
In the paper of Conclusion and Prospect section, we mentioned that "the ensemble learning method achieves good results, but it is not clear that increasing (reducing) the number of single weak learner and changing the type of learner will improve the recognition accuracy.
So it remains be further tested.", so we are not sure that choosing four classifiers will achieve the best performance.

What are the hyper-parameters used for the experiment?
Thank you very much for raising this question. We have made changes in the article, and the results are as follow tables.